Project Description¶

In this project, we will explore data about penguins in Antarctica using unsupervised learning. Our goal is to discover hidden patterns and natural groupings (clusters) among the penguins based on their physical features.


Background¶

A team of researchers studying penguins in Antarctica has been collecting data, and they’ve asked for our help! They believe there are at least three penguin species—Adelie, Chinstrap, and Gentoo—but unfortunately, they did not record the species labels in the dataset.

We’ve been given the dataset in CSV format as:
penguins.csv


Dataset Information¶

Source:
Data collected by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, part of the Long Term Ecological Research Network.

Features Available:

Column Description
culmen_length_mm Length of the penguin's bill (mm)
culmen_depth_mm Depth of the penguin's bill (mm)
flipper_length_mm Length of the flipper (mm)
body_mass_g Body mass (grams)
sex Penguin’s sex (Male/Female)

Our Task¶

We will:

  1. Explore and process the data.
  2. Perform clustering analysis to identify natural groups of penguins.
  3. Decide on a reasonable number of clusters (hint: we expect ~3 species).
  4. Analyze and compare the average feature values for each cluster.

Final Outcome¶

By using unsupervised learning techniques like K-Means Clustering, we aim to:

  • Group penguins into distinct clusters.
  • Provide meaningful insights to the research team.
  • Possibly help them identify penguin species based on the cluster patterns.

Let’s dive in!

In [1]:
# Import Required libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
In [5]:
# Loading and examining the dataset
penguins_df = pd.read_csv("penguins.csv")
penguins_df.head()
Out[5]:
culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex
0 39.1 18.7 181.0 3750.0 MALE
1 39.5 17.4 186.0 3800.0 FEMALE
2 40.3 18.0 195.0 3250.0 FEMALE
3 36.7 19.3 193.0 3450.0 FEMALE
4 39.3 20.6 190.0 3650.0 MALE
In [6]:
penguins_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 332 entries, 0 to 331
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   culmen_length_mm   332 non-null    float64
 1   culmen_depth_mm    332 non-null    float64
 2   flipper_length_mm  332 non-null    float64
 3   body_mass_g        332 non-null    float64
 4   sex                332 non-null    object 
dtypes: float64(4), object(1)
memory usage: 13.1+ KB
In [7]:
# Convert categorical variables into dummy/indicator variables
penguins_df = pd.get_dummies(penguins_df, dtype='int')
In [9]:
# Scaling variables  
scaler = StandardScaler()
X = scaler.fit_transform(penguins_df)
penguins_preprocessed = pd.DataFrame(data=X,columns=penguins_df.columns)
penguins_preprocessed.head()
Out[9]:
culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex_FEMALE sex_MALE
0 -0.903906 0.790360 -1.425342 -0.566948 -0.993994 0.993994
1 -0.830434 0.126187 -1.068577 -0.504847 1.006042 -1.006042
2 -0.683490 0.432728 -0.426399 -1.187953 1.006042 -1.006042
3 -1.344738 1.096901 -0.569105 -0.939551 1.006042 -1.006042
4 -0.867170 1.761074 -0.783164 -0.691149 -0.993994 0.993994
In [10]:
# Detect the optimal number of clusters for k-means clustering using elbow method
inertia = []
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, random_state=42).fit(penguins_preprocessed)
    inertia.append(kmeans.inertia_)    
plt.plot(range(1, 10), inertia, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
No description has been provided for this image
In [11]:
n_clusters=4
In [17]:
# Run the k-means clustering with k=4 
kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(penguins_preprocessed)
penguins_df['label'] = kmeans.labels_

# visualize the clusters (here for the 'culmen_length_mm' column)
plt.scatter(penguins_df['label'], penguins_df['culmen_length_mm'], c=kmeans.labels_, cmap='viridis')
plt.xlabel('Cluster')
plt.ylabel('culmen_length_mm')
plt.xticks(range(int(penguins_df['label'].min()), int(penguins_df['label'].max()) + 1))
plt.title(f'K-means Clustering (K={n_clusters})')
plt.show()
No description has been provided for this image
In [14]:
# create `stat_penguins` DataFrame
numeric_columns = ['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm','label']
stat_penguins = penguins_df[numeric_columns].groupby('label').mean()
stat_penguins
Out[14]:
culmen_length_mm culmen_depth_mm flipper_length_mm
label
0 43.878302 19.111321 194.764151
1 45.563793 14.237931 212.706897
2 40.217757 17.611215 189.046729
3 49.473770 15.718033 221.540984

Final Thoughts:¶

The clusters seem to be well-separated, meaning K-means has identified distinct groups. However, clusters 0, 1, and 2 have overlapping culmen lengths, which may indicate that K=4 is slightly high, or that another clustering method (e.g., DBSCAN or hierarchical clustering) might yield better results.

Cluster 3 (Yellow):¶

This cluster appears to have the highest culmen lengths, meaning these data points share a distinct trait that differentiates them from the rest.